Skip to content

feat(query): RowBinaryWithNamesAndTypes for enchanced type safety #221

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 21 commits into
base: main
Choose a base branch
from

Conversation

slvrtrn
Copy link
Contributor

@slvrtrn slvrtrn commented May 20, 2025

Summary

Warning

This is a work in progress implementation and may change significantly.
It implements RBWNAT for Query only; Insert should be a new PR.

First of all, let's abbreviate RowBinaryWithNamesAndTypes format as RBWNAT, and the regular RowBinary as just RB for simplicity.

There is a significant amount of issues created in the repo regarding schema incompatibility or obscure error messages in the repository (see the full list below). The reason is that the deserialization is effectively implemented in a "data-driven" way, where the user structures dictate the way the stream in RB should be (de)serialized, so it is possible to have a hiccup where two UInt32 may be deserialized as a single UInt64, which in worst case scenario may lead to corrupted data. For example:

This test will deserialize a wrong value on the main branch, cause DateTime64 is streamed as 8 bytes (Int64), and 2x(U)Int32 are also streamed as 8 bytes in total. It correctly throws an error on this branch now with enabled validation mode.

#[tokio::test]
#[cfg(feature = "time")]
async fn test_serde_with() {
    #[derive(Debug, Row, Serialize, Deserialize, PartialEq)]
    struct Data {
        #[serde(with = "clickhouse::serde::time::datetime64::millis")]
        n1: OffsetDateTime, // underlying is still Int64; should not compose it from two (U)Int32
    }

    let client = prepare_database!().with_struct_validation_mode(StructValidationMode::EachRow);
    let result = client
        .query("SELECT 42 :: UInt32 AS n1, 144 :: Int32 AS n2")
        .fetch_one::<Data>()
        .await;

    assert!(result.is_err());
    assert!(matches!(
        result.unwrap_err(),
        Error::InvalidColumnDataType { .. }
    ));
}

This PR introduces:

  • RBWNAT format usage instead of RB, which allows for stronger type safety guarantess. This is regulated by the StructValidationMode client option, which has two possible modes:
    • First(1) (default) - uses RBWNAT and checks the types for the first row only, so it retains most of the performance compared to the Disabled mode, while still providing significantly stronger guarantees.
    • EachRow - uses RBWNAT and every single row is validated. It is expected to be significantly slower than the default mode.
  • New types internal crate that contains utils to deal with RBWNAT and Native data types strings parsing into a proper AST. Rustified from https://github.com/ClickHouse/clickhouse-js/blob/main/packages/client-common/src/parse/column_types.ts, but not entirely. The most important part is the correctness and the tests, the actual implementation detail can be adjusted in the follow-up.
  • An ability to conveniently deserialize map as a HashMap<K, V>, and not only as Vec<(K, V)>, which was confusing.
  • Clearer error messages for schema mismatch.
  • A lot of tests and more to come, especially with difficult corner cases (nested nullable, multi-dimensional mixed arrays/maps/tuples/enum, etc).

Likely possible to implement:

  • Support for "shuffled" structure definition, where the order of the fields does not match the DB, but the names and types are correct; it should be possible by leveraging (perhaps optionally) visit_map API allowed for deserialize_struct instead of current visit_seq, which processes a struct as a tuple.

Source files to look at:

Current benchmarks results

Select numbers

This branch:

compress  validation  elapsed  throughput  received
    none   FirstN(1)   1.474s  2587 MiB/s  3815 MiB
    none        Each   3.053s  1250 MiB/s  3815 MiB

Main branch:

compress  elapsed  throughput  received
    none   1.296s  2943 MiB/s  3815 MiB

Still losing a bit when validating only the first record. Each row validation mode, as expected, is significantly slower. But NYC taxi data (a more real world scenario cause no one streams system.numbers I guess..) shows totally different and very promising results.

NYC taxi data

This branch:

compress  validation  elapsed  throughput  received
    none   FirstN(1)  939.352ms   361 MiB/s   339 MiB
     lz4   FirstN(1)  950.834ms   357 MiB/s   151 MiB
    none        Each  988.465ms   343 MiB/s   339 MiB
     lz4        Each   1.186s   286 MiB/s   151 MiB

Main branch:

compress  elapsed  throughput  received
    none  939.392ms   361 MiB/s   339 MiB
     lz4  943.551ms   360 MiB/s   151 MiB

The difference is, in fact, not that great, especially considering the benefits Each validation mode provides. Perhaps it is not a bad idea use Each as the default mode instead of First(1)?

Issues overview

Note

If an issue is checked in the list, that means there is also a test that demonstrates proper error messages in case of schema mismatch.

Resolved issues

Related issues

Previously closed issues with unclear error messages

Follow-up issues

@mshustov mshustov requested review from Copilot and loyd May 20, 2025 12:36
Copilot

This comment was marked as resolved.

Copy link
Contributor Author

@slvrtrn slvrtrn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a few comments regarding the intermediate implementation.

}
let result = String::from_utf8_lossy(&buffer.copy_to_bytes(length)).to_string();
Ok(result)
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more or less the same as the implementation in the deserializer. Perhaps as a follow-up, all the reader logic can be extracted in similar functions with #[inline(always)]?


#[error("Type parsing error: {0}")]
TypeParsingError(String),
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs revising.

0 => visitor.visit_some(&mut RowBinaryDeserializer {
input: self.input,
validator: inner_data_type_validator,
}),
1 => visitor.visit_none(),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is currently the main drawback of the validation implementation if we want to disable it after the first N rows for better performance. If these first rows are all NULLs, then we do not properly validate the inner type.

@slvrtrn slvrtrn changed the title PoC (Query): RowBinaryWithNamesAndTypes for enchanced type safety feat (query): RowBinaryWithNamesAndTypes for enchanced type safety May 29, 2025
@slvrtrn slvrtrn changed the title feat (query): RowBinaryWithNamesAndTypes for enchanced type safety feat(query): RowBinaryWithNamesAndTypes for enchanced type safety May 29, 2025
return Ok(());
}
Ok(_) => {
// TODO: or panic instead?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning error when we're already returning Result seems correct

Copy link
Contributor Author

@slvrtrn slvrtrn May 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not make sense to handle this error and continue if we cannot even parse the column header with names and types. That means everything goes entirely wrong, and should be unreachable... unless there are some odd network/LB issues?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still seems wrong to panic over bytes controlled by another actor

assert_eq!(actual, sample());
}
}
// #[test]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

intend to restore this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will restore, but not sure if we need this test with so many integration tests that do essentially the same.


shift += 7;
if shift > 57 {
// TODO: what about another error?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the rationale behind 57?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.expect("failed to fetch string");
assert_eq!(result, "\x01\x02\x03\\ \"\'");
//
// let result = client
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

intended to restore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for noticing. Don't know why it was commented out.

@slvrtrn slvrtrn requested a review from Copilot May 30, 2025 10:02
@slvrtrn slvrtrn marked this pull request as ready for review May 30, 2025 10:02
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces support for the RowBinaryWithNamesAndTypes (RBWNAT) format for enhanced type safety in query deserialization, along with new validation modes and improvements in error messages and benchmarks. Key changes include:

  • Adding new macros and tests to assert panic conditions on schema mismatches.
  • Refactoring query execution to use RBWNAT and propagating a client-wide validation mode.
  • Enhancements to serialization/deserialization, including a new utility for LEB128 encoding and improved columns header parsing.

Reviewed Changes

Copilot reviewed 32 out of 32 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tests/it/main.rs Added macros to assert panics during fetch operations.
tests/it/insert.rs Updated table creation and row type in the rename insert test.
tests/it/cursor_stats.rs Adjusted expected decoded byte count to account for the RBWNAT header.
tests/it/cursor_error.rs Revised error handling and test scenarios for query timeouts.
src/validation_mode.rs Introduced new validation mode enum with documentation.
src/rowbinary/ser.rs Switched to using put_leb128 and replaced an error return with a panic.
src/cursors/row.rs Implemented async header reading and conditionally validated rows.
examples/mock.rs Updated the mock provide handler to include schema information.
benches/select_*.rs Updated benchmarks to pass the validation mode to the client.
Cargo.toml Updated bench configuration and dependency versions.
Comments suppressed due to low confidence (1)

src/rowbinary/tests.rs:117

  • The test 'it_deserializes' is commented out, which may reduce test coverage for deserialization; please re-enable or provide context for the deactivation.
// #[test]
// fn it_deserializes() { ... }

return Err(Error::VariantDiscriminatorIsOutOfBound(
variant_index as usize,
));
panic!("max number of types in the Variant data type is 255, got {variant_index}")
Copy link
Preview

Copilot AI May 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of panicking when the variant index exceeds 255, consider returning a proper error to allow for graceful error handling.

Copilot uses AI. Check for mistakes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not make sense to handle this error; this means the entire deserialization went wrong, cause the server ensures max 255 variants per type.

@ilidemi
Copy link

ilidemi commented May 30, 2025

Some comments from Claude Code (could be complete slop, I know little about Rust):

General

1. Panic-Driven Error Handling is Unacceptable
The validation system at src/rowbinary/validation.rs:68-82 uses panic! for schema mismatches. This is fundamentally wrong in a library:

  • Library rule # 1: Never panic on user input
  • Creates unrecoverable failures for recoverable errors
  • Violates Rust's error handling principles
  • Makes debugging extremely difficult in production

Recommendation: Convert all panic! calls to proper Result<T, ValidationError> returns.

2. API Breaking Changes Without Semver

  • Query::fetch() now uses RowBinaryWithNamesAndTypes instead of RowBinary (line 90 in query.rs)
  • RowCursor::new() signature changed to require ValidationMode
  • This changes wire protocol - breaking change disguised as feature addition

Performance optimizations

🔥 High-Impact Optimizations

1. Eliminate Allocations in Error Paths

Location: src/rowbinary/validation.rs:46-51

// Current: Allocates strings in panic paths
format!("{}.{}", self.get_struct_name(), c.name)
"Struct".to_string()

// Better: Use Cow<str> for zero-allocation error messages
use std::borrow::Cow;
fn get_current_column_name(&self) -> Cow<'static, str> {
    // avoid format! allocation
}

2. Optimize String Deserialization

Location: src/rowbinary/de.rs:67, 184

// Current: Always allocates Vec for String
fn read_vec(&mut self, size: usize) -> Result<Vec<u8>> {
    Ok(self.read_slice(size)?.to_vec())  // ❌ Always allocates
}

// Better: Only allocate when necessary
fn deserialize_string<V: Visitor<'data>>(self, visitor: V) -> Result<V::Value> {
    let slice = self.read_slice(size)?;
    match str::from_utf8(slice) {
        Ok(s) => visitor.visit_borrowed_str(s),  // Zero-copy!
        Err(_) => {
            let string = String::from_utf8_lossy(slice).into_owned();
            visitor.visit_string(string)  // Only allocate for invalid UTF-8
        }
    }
}

3. Branch Prediction Optimization in Validation

Location: src/cursors/row.rs:96-104

// Current: Pattern matching on validation count
let (result, not_enough_data) = match self.rows_to_validate {
    0 => rowbinary::deserialize_from::<T>(&mut slice, &[]),
    u64::MAX => rowbinary::deserialize_from::<T>(&mut slice, &self.columns),
    _ => { /* ... */ }
};

// Better: Likely/unlikely hints for branch predictor
let (result, not_enough_data) = if likely(self.rows_to_validate > 0) {
    if self.rows_to_validate == u64::MAX {
        rowbinary::deserialize_from::<T>(&mut slice, &self.columns)
    } else {
        self.rows_to_validate -= 1;
        rowbinary::deserialize_from::<T>(&mut slice, &self.columns)
    }
} else {
    rowbinary::deserialize_from::<T>(&mut slice, &[])
};

🎯 Medium-Impact Optimizations

4. Validation State Caching

Location: src/rowbinary/validation.rs:87-90

// Current: Validates every field access
self.validator.validate(serde_type)?;

// Better: Cache validation results for repeated patterns
struct CachedValidator {
    last_column_idx: usize,
    last_validation: Option<ValidatedState>,
}

5. SIMD-Optimized Size Checks

Location: src/rowbinary/de.rs:89

// Current: Individual size checks
ensure_size(&mut self.input, core::mem::size_of::<$ty>())?;

// Better: Batch size checks for multiple fields
fn ensure_sizes_batch(input: &[u8], sizes: &[usize]) -> Result<()> {
    // SIMD-optimized batch boundary checking
}

6. Avoid Repeated Column Lookups

Location: src/rowbinary/validation.rs:42-52

// Current: String formatting on every error
format!("{}.{}", self.get_struct_name(), c.name)

// Better: Pre-format common error prefixes
struct ErrorContext {
    column_prefix: String,  // Pre-computed once per struct
}

@slvrtrn
Copy link
Contributor Author

slvrtrn commented May 30, 2025

@ilidemi, thanks. Here are some comments on that:

  1. Panic-Driven Error Handling is Unacceptable

It is more than acceptable here. It panics in case of an invalid struct definition in the code, there is no reason to continue (de)serializing junk. It is unsafe, and Rust guidelines explicitly say the following about trying to continue with incorrect values:

The panic! macro signals that your program is in a state it can’t handle and lets you tell the process to stop instead of trying to proceed with invalid or incorrect values.

I get that we have exactly this situation. See the explanation about the Result:

The Result enum uses Rust’s type system to indicate that operations might fail in a way that your code could recover from.

It is not possible to recover a program that uses the crate from an incorrect struct definition. It must be fixed by the user.

  1. API Breaking Changes Without Semver
    Query::fetch() now uses RowBinaryWithNamesAndTypes instead of RowBinary (line 90 in query.rs)
    RowCursor::new() signature changed to require ValidationMode
    This changes wire protocol - breaking change disguised as feature addition

Well, RBWNAT instead of RB is the intention. It will also go as 0.14.0, where breaking changes are actually allowed. RowCursor is private to the end user, it's visibility is pub(crate), so it does not matter.

  1. Optimize String Deserialization

Worth looking into.

  1. Branch Prediction Optimization in Validation

Hints are unstable, and we cannot use them. But RowCursor became ~20% slower, that is a fact, and ideally we need to find a way to reduce the overhead, but I couldn't yet.

  1. Validation State Caching

It already validates only one array value and one key-value pair of a Map.

  1. SIMD-Optimized Size Checks

Worth checking indeed.

  1. Eliminate Allocations in Error Paths
  1. Avoid Repeated Column Lookups

There are no errors, only panics, so does not really matter IMO.

@ilidemi
Copy link

ilidemi commented May 31, 2025

25% hit rate, definitely space for improvement 😊

Got a few more from Opus 4:

  1. On panic vs Result - database could be migrated from underneath the app (or all apps updated but one). The app wouldn't be able to access the data anyway, but there's a difference between failing one path and crashing the process.
  2. ensure_size could be #[inline(always)], although the compiler would likely figure it out
  3. Save on runtime branches for validation on every row by using static generics to distinguish fast path from slow path. Although it agreed that the branch predictor would likely catch on when they stop being taken many times in a row.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants